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ABSTRACT 


In tree-based multicast systems, a relatively small number of 
interior nodes carry the load of forwarding multicast mes- 
sages. This works well when the interior nodes are highly- 
available, dedicated infrastructure routers but it poses a prob- 
lem for application-level multicast in peer-to-peer systems. 
SplitStream addresses this problem by striping the content 
across a forest of interior-node-disjoint multicast trees that 
distributes the forwarding load among all participating peers. 
For example, it is possible to construct efficient SplitStream 
forests in which each peer contributes only as much forward- 
ing bandwidth as it receives. Furthermore, with appropri- 
ate content encodings, SplitStream is highly robust to fail- 
ures because a node failure causes the loss of a single stripe 
on average. We present the design and implementation of 
SplitStream and show experimental results obtained on an 
Internet testbed and via large-scale network simulation. The 
results show that SplitStream distributes the forwarding load 
among all peers and can accommodate peers with different 
bandwidth capacities while imposing low overhead for forest 
construction and maintenance. 
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1. INTRODUCTION 


End-system or application-level multicast [8, 16, 22, 19, 
40, 32, 13, 6, 23] has become an attractive alternative to IP 
multicast. Instead of relying on a multicast infrastructure in 
the network (which is not widely available), the participat- 
ing hosts route and distribute multicast messages using only 
unicast network services. In this paper, we are particularly 
concerned with application-level multicast in peer-to-peer 
(p2p) or cooperative environments where peers contribute 
resources in exchange for using the service. 

Unfortunately, conventional tree-based multicast is inher- 
ently not well matched to a cooperative environment. The 
reason is that in any multicast tree, the burden of duplicat- 
ing and forwarding multicast traffic is carried by the small 
subset of the peers that are interior nodes in the tree. The 
majority of peers are leaf nodes and contribute no resources. 
This conflicts with the expectation that all peers should 
share the forwarding load. The problem is further aggra- 
vated in high-bandwidth applications, like video or bulk file 
distribution, where many peers may not have the capacity 
and availability required of an interior node in a conventional 
multicast tree. SplitStream addresses these problems by en- 
abling efficient cooperative distribution of high-bandwidth 
content in a peer-to-peer system. 

The key idea in SplitStream is to split the content into k 
stripes and to multicast each stripe using a separate tree. 
Peers join as many trees as there are stripes they wish to 
receive and they specify an upper bound on the number of 
stripes that they are willing to forward. The challenge is to 
construct this forest of multicast trees such that an interior 
node in one tree is a leaf node in all the remaining trees and 
the bandwidth constraints specified by the nodes are satis- 
fied. This ensures that the forwarding load can be spread 
across all participating peers. For example, if all nodes wish 
to receive k stripes and they are willing to forward k stripes, 
SplitStream will construct a forest such that the forwarding 
load is evenly balanced across all nodes while achieving low 
delay and link stress across the system. 

Striping across multiple trees also increases the resilience 
to node failures. SplitStream offers improved robustness to 
node failure and sudden node departures like other systems 
that exploit path diversity in overlays [4, 35, 3, 30]. Split- 


Stream ensures that the vast majority of nodes are interior 
nodes in only one tree. Therefore, the failure of a single 
node causes the temporary loss of at most one of the stripes 
(on average). With appropriate data encodings, applica- 
tions can mask or mitigate the effects of node failures even 
while the affected tree is being repaired. For example, ap- 
plications can use erasure coding of bulk data [9] or multiple 
description coding (MDC) of streaming media [27, 4, 5, 30]. 

The key challenge in the design of SplitStream is to con- 
struct a forest of multicast trees that distributes the for- 
warding load subject to the bandwidth constraints of the 
participating nodes in a decentralized, scalable, efficient and 
self-organizing manner. SplitStream relies on a structured 
peer-to-peer overlay [31, 36, 39, 33] to construct and main- 
tain these trees. We implemented a SplitStream prototype 
and evaluated its performance. We show experimental re- 
sults obtained on the PlanetLab [1] Internet testbed and 
on a large-scale network simulator. The results show that 
SplitStream achieves these goals. 

The rest of this paper is organized as follows. Section 2 
outlines the SplitStream approach in more detail. A brief de- 
scription of the structured overlay is given in Section 3. We 
present the design of SplitStream in Section 4. The results 
of our experimental evaluation are presented in Section 5. 
Section 6 describes related work and Section 7 concludes. 


2. THE SPLITSTREAM APPROACH 


In this section, we give a detailed overview of SplitStream’s 
approach to cooperative, high-bandwidth content distribu- 
tion. 


2.1 Tree-based multicast 


In all multicast systems based on a single tree, a partici- 
pating peer is either an interior node or a leaf node in the 
tree. The interior nodes carry all the burden of forward- 
ing multicast messages. In a balanced tree with fanout f 


and height h, the number of interior nodes is £ a and the 
number of leaf nodes is f”. Thus, the fraction of leaf nodes 
increases with f. For example, more than half of the peers 
are leaves in a binary tree, and over 90% of peers are leaves 
in a tree with fanout 16. In the latter case, the forwarding 
load is carried by less than 10% of the peers. All nodes have 
equal inbound bandwidth, but the internal nodes have an 
outbound bandwidth requirement of 16 times their inbound 
bandwidth. Even in a binary tree, which would be impracti- 
cally deep in most circumstances, the outbound bandwidth 
required by the interior nodes is twice their inbound band- 
width. Deep trees are not practical because they are more 
fault prone and they introduce larger delays, which is a prob- 
lem for some applications. 


2.2 SplitStream 


SplitStream is designed to overcome the inherently unbal- 
anced forwarding load in conventional tree-based multicast 
systems. SplitStream strives to distribute the forwarding 
load over all participating peers and respects different ca- 
pacity limits of individual peers. SplitStream achieves this 
by splitting the multicast stream into multiple stripes, and 
using separate multicast trees to distribute each stripe. 

Figure 1 illustrates how SplitStream balances the forward- 
ing load among the participating peers. In this simple ex- 
ample, the original content is split into two stripes and mul- 
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Figure 1: A simple example illustrating the basic 
approach of SplitStream. The original content is 
split into two stripes. An independent multicast tree 
is constructed for each stripe such that a peer is an 
interior node in one tree and a leaf in the other. 


ticast in separate trees. For simplicity, let us assume that 
the original content has a bandwidth requirement of B and 
that each stripe has half the bandwidth requirement of the 
original content. Each peer, other than the source, receives 
both stripes inducing an inbound bandwidth requirement of 
B. As shown in Figure 1, each peer is an internal node in 
only one tree and forwards the stripe to two children, which 
yields an outbound bandwidth of no more than B. 

In general, the content is split into k stripes. Participat- 
ing peers may receive a subset of the stripes, thus controlling 
their inbound bandwidth requirement in increments of B/k. 
Similarly, peers may control their outbound bandwidth re- 
quirement in increments of B/k by limiting the number of 
children they adopt. Thus, SplitStream can accommodate 
nodes with different bandwidths, and nodes with unequal 
inbound and outbound network capacities. 

This works well when the bandwidth bottleneck in the 
communication between two nodes is either at the sender 
or at the receiver. While this assumption holds in many 
settings, it is not universally true. If the bottleneck is else- 
where, nodes may be unable to receive all desired stripes. 
We plan to extend SplitStream to address this issue. For ex- 
ample, nodes could monitor the packet arrival rate for each 
stripe. If they detect that the incoming link for a stripe 
is not delivering the expected bandwidth, they can detach 
from the stripe tree and search for an alternate parent. 


2.3 Applications 


SplitStream provides a generic infrastructure for high- 
bandwidth content distribution. Any application that uses 
SplitStream controls how its content is encoded and divided 
into stripes. SplitStream builds the multicast trees for the 
stripes while respecting the inbound and outbound band- 
width constraints of the peers. Applications need to encode 
the content such that (i) each stripe requires approximately 
the same bandwidth, and (ii) the content can be recon- 
structed from any subset of the stripes of sufficient size. 

In order for applications to tolerate the loss of a subset of 
stripes, they may provide mechanisms to fetch content from 
other peers in the system, or may choose to encode content 
in a manner that requires greater than B/k per stripe, in 
return for the ability to reconstitute the content from less 
than k stripes. For example, a media stream could be en- 
coded using MDC so that the video can be reconstituted 


from any subset of the k stripes with video quality propor- 
tional to the number of stripes received. Hence, if an interior 
node in a stripe tree should fail then clients deprived of the 
stripe are able to continue displaying the media stream at 
reduced quality until the stripe tree is repaired. Such an en- 
coding also allows low-bandwidth clients to receive the video 
at lower quality by explicitly requesting less stripes. 

Another example is the multicasting of file data with era- 
sure coding [9]. Each data block is encoded using erasure 
codes to generate k blocks such that only a (large) subset 
of the k blocks is required to reconstitute the original block. 
Each stripe is then used to multicast a different one of the k 
blocks. Participants receive all stripes and once a sufficient 
subset of the blocks is received the clients are able to recon- 
stitute the original data block. If a client misses a number 
of blocks from a particular stripe for a period of time (while 
the stripe multicast tree is being repaired after an internal 
node has failed) the client can still reconstitute the original 
data blocks due to the redundancy. An interesting alter- 
native is the use of rateless codes [24, 26], which provide 
a simple approach to coordinating redundancy, both across 
stripes and within each stripe. 

Applications also control when to create and tear down a 
SplitStream forest. Our experimental results indicate that 
the maximum node stress to construct a forest and distribute 
1 Mbyte of data is significantly lower than the node stress 
placed on a centralized server distributing the same data. 
Therefore, it is perfectly reasonable to create a forest to 
distribute a few megabytes of data and then tear it down. 
The results also show that the overhead to maintain a forest 
is low even with high churn. Therefore, it is also reasonable 
to create long-lived forests. The ideal strategy depends on 
the fraction of time that a forest is used to transmit data. 


2.4 Properties 


Next, we discuss necessary and sufficient conditions for 
the feasibility of forest construction by any algorithm and 
relate them with what SplitStream can achieve. 

Let N be the set of nodes and k be the number of stripes. 
Each node i € N wants to receive I; (0 < I; < k) distinct 
stripes and is willing to forward a stripe to up to Ci other 
nodes. We call J; the node’s desired indegree and C; its 
forwarding capacity. There is a set of source nodes (S C N) 
whose elements originate one or more of the k stripes (i.e., 
1 < |S| < k). The forwarding capacity Cs of each source 
node s € S must at least equal the number of stripes that s 
originates, Ts. 


DEFINITION 1. Given a set of nodes N and a set of sources 
SCN, forest construction is feasible if it is possible to con- 
nect the nodes such that each node i € N receives I; distinct 
stripes and has no more than C; children. 


The following condition is obviously necessary for the fea- 
sibility of forest construction by any algorithm. 


CONDITION 1. If forest construction is feasible, the sum 
of the desired indegrees cannot exceed the sum of the for- 
warding capacities: 


‘> [p< > Ci (1) 
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Condition 1 is necessary but not sufficient for the feasibil- 
ity of forest construction, as the simple example in Figure 2 


illustrates. The incoming arrows in each node in the figure 
correspond to its desired indegree and the outgoing arrows 
correspond to its forwarding capacity. The total forwarding 
capacity matches the total desired indegree in this example 
but it is impossible to supply both of the rightmost nodes 
with two distinct stripes. The node with forwarding capac- 
ity three has desired indegree one and, therefore, it can only 
provide the same stripe to all its children. 
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Figure 2: An example illustrating that condition 1 
is not sufficient to ensure feasibility of a SplitStream 
forest. 


Condition 2 prevents this problem. It is sufficient to en- 
sure feasibility because it prevents the concentration of for- 
warding capacity in nodes that are unable to forward all 
stripes. 


CONDITION 2. A sufficient condition for the feasibility of 
forest construction is for Condition 1 to hold and for all 
nodes whose forwarding capacity exceeds their desired inde- 
gree to receive or originate all k stripes, i.e., 


Vi: C> h> i+, =k. (2) 


This is a natural condition in a cooperative environment 
because nodes are unlikely to spend more resources improv- 
ing the quality of service perceived by others than on im- 
proving the quality of service that they perceive. Addition- 
ally, inbound bandwidth is typically greater than or equal 
to outbound bandwidth in consumer Internet connections. 

Given a set of nodes that satisfy Condition 2, the Split- 
Stream algorithm can build a forest with very high proba- 
bility provided there is a modest amount of spare capacity 
in the system. The probability of success increases with 
the minimum number of stripes that nodes receives, Imin, 
and the total amount of spare capacity, C = SMe ni 
X vien li: We derive the following rough upper bound on 
the probability of failure in Section 4.5: 


Tetin -2 

ain ET (3) 

As indicated by the upper bound formula, the probability 
of success is very high even with a small amount of spare 
capacity in the system. Additionally, we expect Imin to be 
large for most applications. For example, erasure coding for 
reliable distribution of data and MDC for video distribu- 
tion perform poorly if peers do not receive to most stripes. 
Therefore, we expect configurations where all peers receive 
all stripes to be common. In this case, the algorithm can 
guarantee efficient forest construction with probability one 
even if there is no spare capacity. 

In an open cooperative environment, it is important to 
address the issue of free loaders, which appear to be preva- 
lent in Gnutella [2]. In such an environment, it is desirable 
to strengthen Condition 1 to require that the forwarding ca- 
pacity of each node be greater than or equal to its desired 
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indegree (i.e., Vi € N : Ci > L). (This condition may be un- 
necessarily strong in more controlled settings, for example, 
in a corporate intranet.) Additionally, we need a mecha- 
nism to discourage free loading such that most participants 
satisfy the stronger condition. In some settings, it may be 
sufficient to have the SplitStream implementation enforce 
the condition in the local node. Stronger mechanisms may 
use a trusted execution platform like Microsoft’s Palladium 
or a mechanism based on incentives [28]. This is an inter- 
esting area of future work. 


3. BACKGROUND 


SplitStream is implemented using tree-based application- 


level multicast. There are many proposals for how application- 


level multicast trees can be built and maintained [8, 19, 40, 
13, 32, 22, 6, 23]. In this paper, we consider the implemen- 
tation of SplitStream using Scribe [13] and Pastry [33]. It 
could also be implemented using a different overlay protocol 
and group communication system; for example, Bayeux on 
Tapestry [39, 40] or Scribe on CAN [15]. Before describ- 
ing the SplitStream design, we provide a brief overview of 
Pastry and Scribe. 


3.1 Pastry 


Pastry is a scalable, self-organizing structured peer-to- 
peer overlay network similar to CAN [31], Chord [36], and 
Tapestry [39]. In Pastry, nodes and objects are assigned ran- 
dom identifiers (called nodeIds and keys, respectively) from 
a large id space. Nodelds and keys are 128 bits long and can 
be thought of as a sequence of digits in base 2° (b is a con- 
figuration parameter with a typical value of 3 or 4). Given 
a message and a key, Pastry routes the message to the node 
with the nodeld that is numerically closest to the key, which 
is called the key’s root. This simple capability can be used 
to build higher-level services like a distributed hash table 
(DHT) or an application-level group communication system 
like Scribe. 

In order to route messages, each node maintains a routing 
table and a leaf set. A node’s routing table has about logs» N 
rows and 2° columns. The entries in row r of the routing 
table refer to nodes whose nodelds share the first r digits 
with the local node’s nodeld. The (r + 1)th nodeld digit of 
a node in column c of row r equals c. The column in row 
r corresponding to the value of the (r + 1)th digit of the 
local node’s nodeld remains empty. At each routing step, a 
node normally forwards the message to a node whose nodeld 
shares with the key a prefix that is at least one digit longer 
than the prefix that the key shares with the present node’s 
id. If no such node is known, the message is forwarded to a 
node whose nodeld shares a prefix with the key as long as 
the current node’s nodeld but is numerically closer. Figure 3 
shows the path of an example message. 

Each Pastry node maintains a set of neighboring nodes in 
the nodeld space (called the leaf set), both to ensure reliable 
message delivery, and to store replicas of objects for fault 
tolerance. The expected number of routing hops is less than 
logy» N. The Pastry overlay construction observes proxim- 
ity in the underlying Internet. Each routing table entry is 
chosen to refer to a node with low network delay, among all 
nodes with an appropriate nodeld prefix. As a result, one 
can show that Pastry routes have a low delay penalty: the 
average delay of Pastry messages is less than twice the IP 
delay between source and destination [11]. Similarly, one 
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Figure 3: Routing a message from the node with 
nodeld 65alfc to key d46alc. The dots depict the 
nodelds of live nodes in Pastry’s circular namespace. 


can show the local route convergence of Pastry routes: the 
routes of messages sent to the same key from nearby nodes 
in the underlying Internet tend to converge at a nearby in- 
termediate node. Both of these properties are important for 
the construction of efficient multicast trees, described below. 
A full description of Pastry can be found in [33, 11, 12]. 


3.2 Scribe 


Scribe [13, 14] is an application-level group communica- 
tion system built upon Pastry. A pseudo-random Pastry key, 
known as the groupId, is chosen for each multicast group. A 
multicast tree associated with the group is formed by the 
union of the Pastry routes from each group member to the 
groupld’s root (which is also the root of the multicast tree). 
Messages are multicast from the root to the members using 
reverse path forwarding [17]. 

The properties of Pastry ensure that the multicast trees 
are efficient. The delay to forward a message from the root 
to each group member is low due to the low delay penalty of 
Pastry routes. Pastry’s local route convergence ensures that 
the load imposed on the physical network is small because 
most message replication occurs at intermediate nodes that 
are close in the network to the leaf nodes in the tree. 

Group membership management in Scribe is decentral- 
ized and highly efficient, because it leverages the existing, 
proximity-aware Pastry overlay. Adding a member to a 
group merely involves routing towards the groupId until the 
message reaches a node in the tree, followed by adding the 
route traversed by the message to the group multicast tree. 
As a result, Scribe can efficiently support large numbers of 
groups, arbitrary numbers of group members, and groups 
with highly dynamic membership. 

The latter property, combined with an anycast [14] prim- 
itive recently added to Scribe, can be used to perform dis- 
tributed resource discovery. Briefly, any node in the overlay 
can anycast to a Scribe group by routing the message to- 
wards the groupId. Pastry’s local route convergence ensures 
that the message reaches a group member near the mes- 
sage’s sender with high probability. A full description and 
evaluation of Scribe multicast can be found in [13]. Scribe 
anycast is described in [14]. 
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Figure 4: SplitStream’s forest construction. The 
source splits the content and multicasts each stripe 
in its designated tree. Each stripe’s stripeId starts 
with a different digit. The nodelds of interior nodes 
share a prefix with the stripeId, thus they must be 
leaves in the other trees, e.g., node M with a nodeld 
starting with 1 is an interior node in the tree for 
the stripeId starting with 1 and a leaf node in other 
trees. 


4. SPLITSTREAM DESIGN 


In this section, we describe the design of SplitStream. We 
begin with the construction of interior-node-disjoint trees 
for each of the stripes. Then, we discuss how SplitStream 
balances the forwarding capacity across nodes, such that the 
bandwidth constraints of each node are observed. 


4.1 Building interior-node-disjoint trees 


SplitStream uses a separate Scribe multicast tree for each 
of the k stripes. A set of trees is said to be interior-node- 
disjoint if each node is an interior node in at most one 
tree, and a leaf node in the other trees. SplitStream ex- 
ploits the properties of Pastry routing to construct interior- 
node-disjoint trees. Recall that Pastry normally forwards a 
message towards nodes whose nodelds share progressively 
longer prefixes with the message’s key. Since a Scribe tree 
is formed by the routes from all members to the groupld, 
the nodelds of all interior nodes share some number of dig- 
its with the tree’s groupId. Therefore, we can ensure that k 
Scribe trees have a disjoint set of interior nodes simply by 
choosing grouplds for the trees that all differ in the most 
significant digit. Figure 4 illustrates the construction. We 
call the grouplId of a stripe group the stripeld of the stripe. 

We can choose a value of b for Pastry that achieves the 
value of k suitable for a particular application. Setting 
2° = k ensures that each participating node has an equal 
chance of becoming an interior node in some tree. Thus, 
the forwarding load is approximately balanced. If b is cho- 
sen such that k = 2’, i < b, it is still possible to ensure this 
fairness by exploiting certain properties of the Pastry rout- 
ing table, but we omit the details due to space constraints. 
Additionally, it is fairly easy to change the Pastry implemen- 
tation to route using an arbitrary base that is not a power 
of 2. Without loss of generality, we assume that 2° = k in 
the rest of this paper. 


4.2 Limiting node degree 


The resulting forest of Scribe trees is interior-node-disjoint 
and satisfies the nodes’ constraints on the inbound band- 
width. To see this, observe that a node’s inbound band- 
width is proportional to the desired indegree, which is the 
number of stripes that the node chooses to receive. It is 


assumed that each node receives and forwards at least the 
stripe whose stripeld shares a prefix with its nodeld, be- 
cause the node may have to serve as an interior node for 
that stripe. 

However, the forest does not necessarily satisfy nodes’ 
constraints on outbound bandwidth; some nodes may have 
more children than their forwarding capacity. The number 
of children that attach to a node is bounded by its indegree 
in the Pastry overlay, which is influenced by the physical 
network topology. This number may exceed a node’s for- 
warding capacity if the node does not limit its outdegree. 

Scribe has a built-in mechanism (called “push-down”) to 
limit a node’s outdegree. When a node that has reached 
its maximal outdegree receives a request from a prospec- 
tive child, it provides the prospective child with a list of 
its current children. The prospective child then seeks to be 
adopted by the child with lowest delay. This procedure con- 
tinues recursively down the tree until a node is found that 
can take another child. This is guaranteed to terminate 
successfully with a single Scribe tree provided each node is 
required to take on at least one child. 

However, this procedure is not guaranteed to work in 
SplitStream. The reason is that a leaf node in one tree may 
be an interior node in another tree, and it may have already 
reached its outdegree limit with children in this other tree. 
Next, we describe how SplitStream resolves this problem. 


4.3 Locating parents 


The following algorithm is used to resolve the case where 
a node that has reached its outdegree limit receives a join 
request from a prospective child. First, the node adopts the 
prospective child regardless of the outdegree limit. Then, 
it evaluates its new set of children to select a child to re- 
ject. This selection is made in an attempt to maximize the 
efficiency of the SplitStream forest. 

First, the node looks for children to reject in stripes whose 
stripelds do not share a prefix with the local node’s nodeld. 
(How the node could have acquired such a child in the first 
place will become clear in a moment.) If the prospective 
child is among them, it is selected; otherwise, one is chosen 
randomly from the set. If no such child exists, the current 
node is an interior node for only one stripe tree, and it selects 
the child whose nodeld has the shortest prefix match with 
that stripeld. If multiple such nodes exist and the prospec- 
tive child is among them, it is selected; otherwise, one is 
chosen randomly from the set. The chosen child is then no- 
tified that it has been orphaned for a particular stripeld. 
This is exemplified in Figure 5. 

The orphaned child then seeks to locate a new parent 
in up to two steps. In the first step, the orphaned child 
examines its former siblings and attempts to attach to a 
random former sibling that shares a prefix match with the 
stripeld for which it seeks a parent. The former sibling 
either adopts or rejects the orphan, using the same criteria 
as described above. This “push-down” process continues 
recursively down the tree until the orphan either finds a new 
parent or no children share a prefix match with the stripeld. 
If the orphan has not found a parent the second step uses 
the spare capacity group. 


4.4 Spare capacity group 


If the orphan has not found a parent, it sends an anycast 
message to a special Scribe group called the spare capacity 
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Figure 5: Handling of prospective children by a node that has reached its outdegree limit. Circles represent 
nodes and the numbers close to them are their nodelds (* is a wildcard). Solid lines indicate that the bottom 
node is a child of the top node and the number close to the line is the stripeId. Dashed lines represent 
requests to join a stripe. The node with id 080* has reached its outdegree limit of 4. (1) Node 001* requests 
to join stripe 0800. (2) Node 080* takes 001* as a child and drops 9*, which was a child in stripe 1800 that 
does not share the first digit with 080*. (3) Then node 085* requests to join stripe 0800. (4) Node 080* 
takes 085* as a child and drops 001*, which has a shorter prefix match with stripe 0800 than other children. 


group. All SplitStream nodes that have less children in stripe 
trees than their forwarding capacity limit are members of 
this group. Scribe delivers this anycast message to a node 
in the spare capacity group tree that is near the orphan in 
the physical network. This node starts a depth-first search 
(DFS) of the spare capacity group tree by forwarding the 
message to a child. If the node has no children or they have 
all been checked, the node checks whether it receives one 
of the stripes which the orphaned child seeks to receive (in 
general, this is the set of stripe groups that the orphan has 
not already joined). If so, it verifies that the orphan is not 
an ancestor in the corresponding stripe tree, which would 
create a cycle. To enable this test, each node maintains its 
path to the root of each stripe that it receives. 

If both tests succeed, the node takes on the orphan as a 
child. If the node reaches its outdegree limit as a result, it 
leaves the spare capacity group. If one of the tests fails, the 
node forwards the message to its parent and the DFS of the 
spare capacity tree continues until an appropriate member 
is found. This is illustrated in Figure 6. 

The properties of Scribe trees and the DFS of the spare 
capacity tree ensure that the parent is near the orphan in 
the physical network. This provides low delay and low link 
stress. However, it is possible for the node to attach to a 
parent that is already an interior node in another stripe tree. 
If this parent fails, it may cause the temporary loss of more 
than one stripe for some nodes. We show in Section 5 that 
only a small number of nodes and stripes are affected on 
average. 

Anycasting to the spare capacity group may fail to locate 
an appropriate parent for the orphan even after an appropri- 
ate number of retries with sufficient timeouts. If the spare 
capacity group is empty, the SplitStream forest construction 
is infeasible because an orphan remains after all forwarding 
capacity has been exhausted. In this case, the application 
on the orphaned node is notified that there is no forwarding 
capacity left in the system. 

Anycasting can fail even when there are group members 
with available forwarding capacity in the desired stripe. This 
can happen if attaching the orphan to receive the stripe from 
any of these members causes a cycle because the member is 
the orphan itself or a descendant of the orphan. We solve 
this problem as follows. The orphan locates any leaf in the 
desired stripe tree that is not its descendant. It can do this 
by anycasting to the stripe tree searching for a leaf that is 


anycast 
for 6 


in: {0,3,A} in: {1,...,16} 
spare: 2..." spare: 4 


° 
° 


oor t tte ee 
... 


Figure 6: Anycast to the spare capacity group. 
Node 0 anycasts to the spare capacity group to find 
a parent for the stripe whose identifier starts with 
6. The request is routed by Pastry towards the root 
of the spare capacity group until it reaches node 1, 
which is already in the tree. Node 1 forwards the 
request to node 2, one of its children. Since 2 has no 
children, it checks if it can satisfy node 0’s request. 
In this case, it cannot because it does not receive 
stripe 6. Therefore, it sends the request back to 
node 1 and node 1 sends it down to its other child. 
Node 3 can satisfy 0’s request and replies to 0. 


not its descendant. Such a leaf is guaranteed to exist because 
we require a node to forward the stripe whose stripeld starts 
with the same digit as its nodeld. The orphan replaces the 
leaf on the tree and the leaf becomes an orphan. The leaf 
can attach to the stripe tree using the spare capacity group 
because this will not cause a cycle. 

Finally, anycasting can fail if no member of the spare ca- 
pacity group provides any of the desired stripes. In this case, 
we declare failure and notify the application. As we argue 
next, this is extremely unlikely to happen when the sufficient 
condition for forest construction (Condition 2) holds. 


4.5 Correctness and complexity 


Next, we argue informally that SplitStream can build a 
forest with very high probability provided the set of nodes N 
satisfies the sufficient condition for feasibility (Condition 2) 
and there is a modest amount of spare capacity in the sys- 
tem. The analysis assumes that all nodes in N join the forest 


at the same time and that communication is reliable to en- 
sure that all the spare capacity in the system is available in 
the spare capacity group. It also assumes that nodes do not 
leave the system either voluntarily or due to failures. Split- 
Stream includes mechanisms to deal with violations of each 
of these assumptions but these problems may block some 
parents that are available to forward stripes to orphans, 
e.g., they may be unreachable by an orphan due to com- 
munication failures. If these problems persist, SplitStream 
may be unable to ensure feasibility even if Condition 2 holds 
at every instant. We ignore these issues in this analysis but 
present simulation results that show SplitStream can cope 
with them in practice. 

The construction respects the bounds on forwarding ca- 
pacity and indegree because nodes reject children beyond 
their capacity limit and nodes do not seek to receive more 
stripes than their desired indegree. Additionally, there are 
no cycles by construction because an orphan does not attach 
to a parent whose path to the root includes the orphan. The 
issue is whether all nodes can receive as many distinct stripes 
as they desire. 

When node i joins the forest, it selects I; stripes uniformly 
at random. (The stripe whose stripeld starts with the same 
digit as is nodeld is always selected but the nodeld is se- 
lected randomly with uniform probability from the id space.) 
Then i joins the spare capacity tree advertising that it will 
be able to forward the selected stripes and attempts to join 
the corresponding stripe trees. We will next estimate the 
probability that the algorithm leaves an orphan that cannot 
find a desired stripe. 

There are two ways for a node į to acquire a parent for 
each selected stripe s: (1) joining the stripe tree directly 
without using the spare capacity group, or (2) anycasting to 
the spare capacity group. If s is the stripe whose stripeld 
starts with the same digit as 7’s nodeld, i is guaranteed to 
find a parent using (1) after being pushed down zero or more 
times and this may orphan another node. The algorithm 
guarantees that i never needs to use (2) to locate a parent 
on this stripe. The behavior is different when 7 uses (1) to 
locate a parent on another stripe; it may fail to find a parent 
but it will never cause another node to become an orphan. 

When a node i first joins a stripe s, it uses (1) to finda 
parent. If the identifiers of i and s do not share the first 
digit, 7 may fail to find a parent for s after being pushed 
down at most hs times (where hs is the height of s’s tree) 
but it does not cause any other node to become an orphan. 
If the trees are balanced we expect that hs is O(log|N]). 

If the identifiers of i and s share the same digit, i is guar- 
anteed to find a parent using (1) but it may orphan another 
node j on the same stripe or on a different stripe r. In this 
case, j attempts to use (1) to acquire a parent on the lost 
stripe. There are three sub-cases: (a) j looses a stripe r 
(r Æ s), (b) j looses stripe s and the identifiers of j and s 
do not share the first digit, and (c) j looses stripe s and the 
identifiers of j and s share the first digit. The algorithm 
ensures that in case (a), the first digit in j’s nodeId does 
not match the first digit in r’s stripeld. This ensures that 
this node j does not orphan any other node when it uses (1) 
to obtain a parent for r. Similarly, j will not orphan any 
other node in case (b). In cases (a) and (b), j either finds a 
parent or fails to find a parent after being pushed down at 
most h, or hs times. In case (c), we can view j as resuming 
the walk down s’s tree that was started by i. 


Therefore, in all cases, 2’s first join of stripe s results in at 
most one orphan j (not necessarily i = j) that uses anycast 
to find a parent for a stripe r (not necessarily r = s) after 
O(log|N|) messages. This holds even with concurrent joins. 

If an orphan j attempts to locate a parent for stripe r by 
anycasting to the spare capacity group, it may fail to find 
a node in the spare capacity group that receives stripe r. 
We call the probability of this event Py. It is also possible 
that all nodes in the spare capacity group that receive r 
are descendants of j. Our construction ensures that the 
identifiers of j and r do not share the first digit. Therefore, 
the expected number of descendants of 7 for stripe r should 
be O(1) and small if trees are well balanced. The technique 
that handles this case succeeds in finding a parent for j with 
probability one and leaves an orphan on stripe r that has no 
descendants for stripe r. In either case, we end up with a 
probability of failure Py and an expected cost of O(log|N]|) 
messages on success. 

We will compute an upper bound on Py. We start by 
assuming that we know the set 1,...,/ of nodes in the spare 
capacity group when the anycast is issued and their desired 
indegrees J, ..., I1. We can compute an exact value for Py 
with this information: 
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If all nodes join at least Imin stripes, we can compute an 
upper bound on Py that does not require knowledge of the 


desired indegrees of nodes in the spare capacity group: 


Imin Ll 
me) 

We can assume that each node 7 in the spare capacity 
group has spare capacity less than k; otherwise, Condition 2 
implies that J; = k and i can satisfy the anycast request. 
Since the spare capacity in the system C = iyicen Ci — 
view Li is the minimum capacity available in the spare 
capacity group at any time, | > C/(k — 1) and so 
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This bound holds even with concurrent anycasts because we 
use the minimum spare capacity in the system to compute 
the bound. 

There are at most |N| node joins and each node joins at 
most k stripes. Thus the number of anycasts issued during 
forest construction that may fail is bounded by |N| x k. 
Since the probability of A or B occurring is less than or 
equal to the probability of A plus the probability of B, the 
following is a rough upper bound on the probability that the 
algorithm fails to build a feasible forest 
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The probability of failure is very low even with a mod- 
est amount of spare capacity in the system. For example, 
the predicted probability of failure is less than 1071! with 
|N| = 1,000,000, k = 16, Imin = 1, and C = 0.01 x |N]. 
The probability of failure decreases when Imin increases, for 
example, it is 1071 in the same setting when Imin = 8 and 
C = 0.001 x |N|. When the desired indegree of all nodes 
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equals the total number of stripes, the algorithm never fails. 
In this case, it is guaranteed to build a forest if the sum of 
the desired indegrees does not exceed the sum of the for- 
warding capacities even when there is no spare capacity. 

Next, we consider the algorithmic complexity of the Split- 
Stream forest construction. The expected amount of state 
maintained by each node O(log|N|). The expected number 
of messages to build the forest is O(|N|log|N|) if the trees 
are well balanced or O(|N|?) in the worst case. We expect 
the trees to be well balanced if each node forwards the stripe 
whose identifier shares the first digit with the node’s identi- 
fier to at least two other nodes. 


5. EXPERIMENTAL EVALUATION 


This section presents results of experiments designed to 
evaluate both the overhead of forest construction in Split- 
Stream and the performance of multicasts using the forest. 
We ran large-scale experiments on a network simulation en- 
vironment and live experiments on the PlanetLab Internet 
testbed [1]. The results support our hypothesis that the 
overhead of maintaining a forest is low and that multicasts 
perform well even with high churn in the system. 


5.1 Experimental setup 
We start by describing the experimental setup. 


Network simulation: We used a packet-level, discrete- 
event network simulator. The simulator models the prop- 
agation delay on the physical links but it does not model 
queuing delay, packet losses, or cross traffic because mod- 
eling these would prevent large-scale network simulations. 
In the simulations we used three different network topology 
models: GATech, Mercator and CorpNet. Each topology has 
a set of routers and we ran SplitStream on end nodes that 
were randomly assigned to routers with uniform probability. 
Each end node was directly attached by two LAN links (one 
on each direction) to its assigned router. The delay on LAN 
links was set to lms. 

GATech is a transit-stub topology model generated by the 
Georgia Tech [38] random graph generator. This model has 
5050 routers arranged hierarchically. There are 10 transit 
domains at the top level with an average of 5 routers in each. 
Each transit router has an average of 10 stub domains at- 
tached, and each stub has an average of 10 routers. The link 
delays between routers are computed by the graph generator 
and routing is performed using the routing policy weights of 
the graph generator. We generated 10 different topologies 
using the same parameters but different random seeds. All 
GATech results are the average of the results obtained by 
running the experiment in each of these 10 topologies. We 
did not attach end nodes to transit routers. 

Mercator is a topology model with 102,639 routers, obtained 
from measurements of the Internet using the Mercator sys- 
tem [21]. The authors of [37] used measured data and some 
simple heuristics to assign an autonomous system to each 
router. The resulting AS overlay has 2,662 nodes. Routing 
is performed hierarchically as in the Internet. A route fol- 
lows the shortest path in the AS overlay between the AS of 
the source and the AS of the destination. The routes within 
each AS follow the shortest path to a router in the next AS 
of the AS overlay path. We use the number of IP hops as a 
proxy for delay because there is no link delay information. 
CorpNet is a topology model with 298 routers and was gener- 


ated using measurements of the world-wide Microsoft corpo- 
rate network. Link delays between routers are the minimum 
delay measured over a one month period. 

Due to space constraints, we present results only for the 
GATech topology for most experiments but we compare 
these results with those obtained with the other topologies. 


SplitStream configuration: The experiments ran Pastry 
configured with b = 4 and a leaf set with size l = 16. The 
number of stripes per SplitStream multicast channel was 
k = 2 = 16. 

We evaluated SplitStream with six different configurations 
of node degree constraints. The first four are designed to 
evaluate the impact on overhead and performance of vary- 
ing the spare capacity in the system. We use the notation 
x x y to refer to these configurations where x is the value 
of the desired indegree for all nodes and y is the forwarding 
capacity of all nodes. The four configurations are: 16 x 16, 
16x18, 16x32 and 16x NB. The 16x 16 configuration has a 
spare capacity of only 16 because the roots of the stripes do 
not consume forwarding capacity for the stripes they orig- 
inate. The 16 x 18 and 16 x 32 configurations provide a 
spare capacity of 12.5% and 100%, respectively. Nodes do 
not impose any bound on forwarding capacity in the 16x NB 
configuration. SplitStream is able to build a forest in any 
of these configurations with probability 1 (as shown by our 
analysis). 

The last two configurations are designed to evaluate the 
behavior of SplitStream when nodes have different desired 
indegrees and forwarding capacities. The configurations are 
d x d and Gnutella. In d x d, each node has a desired inde- 
gree equal to its forwarding capacity and it picks stripes to 
receive as follows: it picks the stripe whose stripeld has the 
same first digit as its nodeld with probability 1 and it picks 
each of the other stripes with probability 0.5. The analysis 
predicts a low probability of success when building a forest 
with the d x d configuration because there is virtually no 
spare capacity in the system and nodes do not receive all 
stripes. But SplitStream was able to build a forest success- 
fully in 9 out of 10 runs. 

The Gnutella configuration is derived from actual mea- 
surements of inbound and outbound bandwidth of Gnutella 
peers [34]. The CDF of these bandwidths is shown in Fig- 
ure 7. This distribution was sampled to generate pairs of in- 
bound (bin) and outbound (bout) bandwidths for each Split- 
Stream node. Many nodes have asymmetric bandwidth as 
is often the case with DSL or cable modem connections. 
We assumed a total bandwidth for a stream of 320Kbps 
with 20Kbps per stripe. The desired indegree of a node was 
set to max(1, min(bin/20K bps, 16)) and the forwarding ca- 
pacity to min(bout/20K bps, 32). Approximately, 90% of the 
nodes have a desired indegree of 16. The spare capacity in 
the system is 6.9 per node. The analysis predicts a suc- 
cess probability of almost 1 when building a forest with this 
configuration. 


SplitStream Implementation: The SplitStream imple- 
mentation used in the simulation experiments has three op- 
timisations that were omitted in Section 4 for clarity. 

The first optimisation reduces the overhead of cycle detec- 
tion during forest construction. As described in Section 4, 
each node maintains the path to the root of each stripe that 
it receives. Maintaining this state incurs a considerable com- 
munication overhead because updates to the path are multi- 
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links of Gnutella peers. 


cast to all descendants. We can avoid maintaining this state 
by taking advantage of the prefix routing properties of Pas- 
try. Cycles are possible only if some parent does not share 
a longer prefix with a stripeld than one of its children for 
that stripe. Therefore, nodes need to store and update path 
information only in such cases. This optimisation reduces 
the overhead of forest construction by up to 40%. 

The second optimisation improves anycast performance. 
When a node joins the SplitStream forest, it may need to 
perform several anycasts to find a parent for different stripes. 
Under the optimization, these anycasts are batched; the 
node uses a single anycast to find parents for multiple stripes. 
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Figure 8: Cumulative distribution of node stress 
during forest construction with 40,000 nodes on 
GATech. 


the spare capacity in the system increases. With more spare 
capacity, nodes are orphaned less often and so there are less 
pushdowns and anycasts. The 16 x NB configuration has 
the lowest node stress because there are no pushdowns or 
anycasts. The 16 x 16 configuration has the highest node 
stress because each node uses an anycast to find a parent for 
8 stripes on average. The nodes with the maximum node 
stress in all configurations (other than 16 x NB) are those 
with nodelds closest to the identifier of the spare capacity 
group. Table 1 shows that increasing the spare capacity of 
the 16 x 16 configuration by only 12.5% results in a factor 
of 2.7 decrease in the maximum node stress. 


This optimization can reduce the number of anycasts per- 
formed during forest construction by up to a factor of eight. 
The third optimization improves the DFS traversal of the 
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anycast tree. A parent adds the list of its children to an 
anycast message before forwarding the message to a child. If 
the child is unable to satisfy the anycast request, it removes 
itself from the list and sends the message to one of its siblings 
(avoiding another visit to the parent). 


5.2 Forest construction overhead 


The first set of experiments measured the overhead of for- 
est construction without node failures. They started from 
a Pastry overlay with 40,000 nodes and built a SplitStream 
forest with all overlay nodes. All the nodes joined the spare 
capacity and the stripe groups at the same time. The over- 
heads would be lower with less concurrency. 

We used two metrics to measure the overhead: node stress 
and link stress. Node stress quantifies the load on nodes. 
A node’s stress is equal to the number of messages that it 
receives. Link stress quantifies the load on the network. The 
stress of a physical network link is equal to the number of 
messages sent over the link. 


Node stress: Figures 8 and 9 show the cumulative distribu- 
tion of node stress during forest construction with different 
configurations on the GATech topology. Figure 8 shows re- 
sults for the 16 x y configurations and Figure 9 shows results 
for dx d and Gnutella. A point (x,y) in the graph indicates 
that a fraction y of all the nodes in the topology has node 
stress less than or equal to x. Table 1 shows the maximum, 
mean and median node stress for these distributions. The 
results were similar on the other topologies. 

Figure 8 and Table 1 show that the node stress drops as 


Table 1: Maximum, mean and median node stress 
during forest construction with 40,000 nodes on 
GATech. 


Figure 9 shows that the node stress is similar with Gnutella 
and 16 x 16. Gnutella has a significant amount spare capac- 
ity but not all members of the spare capacity group receive 
all stripes, which increases the length of the DFS traversals 
of the anycast tree. d x d has the same spare capacity as 
16 x 16 but it has lower node stress because nodes join only 
9 stripe groups on average instead of 16. 


~dxd 
— Gnutella 
— 16x16 


1 10 100 1000 10000 
Node Stress 


Figure 9: Cumulative distribution of node stress 
during forest construction with 40,000 nodes on 
GATech. 


We also ran experiments to evaluate the node stress dur- 
ing forest construction for overlays with different sizes. Fig- 
ure 10 shows the mean node stress with the 16 x x configura- 
tions. The maximum and median node stress show similar 
trends as the number of nodes increases. 
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Figure 10: Mean node stress during forest construc- 
tion with varying number of nodes on GATech. 


From the analysis, we would expect the mean node stress 
to be O(logn), where n is the number of nodes. But there 
is a second effect that is ignored in the analysis; the average 
cost of performing an anycast and joining a stripe group 
decreases with the group size. The first effect dominates 
when n is small and the second effect compensates the first 
for large n. This suggests that our complexity analysis is 
too pessimistic. 

The results demonstrate that the node stress during for- 
est construction is low and is largely independent of the 
number of nodes in the forest. It is interesting to contrast 
SplitStream with a centralized server distributing an s-byte 
file to a large number of clients. With 40,000 clients, the 
server will have a node stress of 40,000 handling requests 
and will send a total of s x 40,000 bytes. The maximum 
node stress during SplitStream forest construction with the 
16 x 16 configuration is 13.5 times lower, and the number of 
bytes sent by each node to multicast the file over the forest 
is bounded by s. 


Link stress: Figure 11 shows the cumulative distribution 
of link stress during forest construction with different config- 
urations on GATech. A point (x,y) in the graph indicates 
that a fraction y of all the links in the topology has link 
stress less than or equal to x. Table 2 shows the maximum, 
mean and median link stress for links with non-zero link 
stress. It also shows the fraction of links with non-zero link 
stress (links). Like node stress, the link stress drops as the 
spare capacity in the system increases and the reason is the 
same. The links with the highest stress are transit links. 

The results were similar on the other topologies: the me- 
dians were almost identical and the averages were about 10% 
lower in CorpNet and 10% higher in Mercator. The differ- 
ence in the averages is explained by the different ratio of the 
number of LAN links to the number of router-router links 
in the topologies. This ratio is highest in CorpNet and it is 
lowest in Mercator. 

We conclude that the link stress induced during forest 
construction is low on average. The maximum link stress 
across all configurations is at least 6.8 times lower than the 
maximum link stress in a centralized system with the same 
number of nodes. 
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Figure 11: Cumulative distribution of link stress 
during forest construction with 40,000 nodes on 
GATech. 
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Table 2: Maximum, mean, and median link stress 
for used links and fraction of links (Links) used dur- 
ing forest construction with 40,000 nodes on GAT- 
ech. 


5.3 Forest multicast performance 


We also ran experiments to evaluate the performance of 
multicast over a SplitStream forest without failures. To put 
this performance in perspective, we compared SplitStream, 
IP multicast, and Scribe. IP multicast represents the best 
that can be achieved with router support and Scribe is a 
state-of-the-art, application-level multicast system that uses 
a single tree. We used three metrics to evaluate multicast 
performance in the comparison: node stress, link stress, and 
delay penalty relative to IP. 

The experiments ran on the 40,000-node Pastry overlay 
that we described in the previous section. Our implemen- 
tation of IP multicast used a shortest path tree formed by 
the merge of the unicast routes from the source to each re- 
cipient. This is similar to what could be expected in our 
experimental setting using protocols like Distance Vector 
Multicast Routing Protocol (DVMRP) [18]. We created a 
single group for IP multicast and Scribe that all overlay 
nodes joined. The tree for the Scribe group was created 
without the bottleneck remover [13]. 

To evaluate SplitStream, we multicast a data packet to 

each of the 16 stripe groups of a forest with 40,000 nodes 
(built as described in the previous section). For both Scribe 
and IP multicast, we multicast 16 data packets to the single 
group. 
Node stress: During this multicast experiment, the node 
stress of each SplitStream node is equal to its desired inde- 
gree, for example, 16 in the 16 x x configurations or less in 
others. Scribe and IP multicast also have a node stress of 16. 
The number of messages sent by each node in SplitStream 
is also bounded by the forwarding capacity of each node, for 
example, it is 16 in the 16 x 16 configuration. This is im- 
portant because it enables nodes with different capabilities 
to participate in the system. 


Link stress: We start by presenting results of experiments 
that measured link stress during multicast with different 
SplitStream forest sizes. Table 3 shows the results of these 
experiments with the 16x 16 configuration in GATech. When 
a SplitStream node is added to the system, it uses two ad- 
ditional LAN links (one on each direction) and it induces 
a link stress of 16 in both. Adding nodes also causes an 
increase on the link stress of router-router links because the 
router network remains fixed. Since the majority of links 
for the larger topologies are LAN links, the median link 
stress remains constant and the mean link stress grows very 
slowly. The maximum link stress increases because the link 
stress in router-router links increases. This problem affects 
all application-level multicast systems. 
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Table 3: Maximum, mean and median link stress for 
used links and fraction of links (Links) used during 
multicast with the 16 x 16 configuration on GATech 
with varying number of nodes. 


The next set of experiments compared the link stress dur- 
ing multicast with different SplitStream configurations and 
40,000 nodes on GATech. Table 4 shows the maximum, 
mean and median link stress for used links, and the fraction 
of links used in these experiments. The results show that 
the link stress tends to decrease when the spare capacity 
increases. However, the absence of bounds on forwarding 
capacity in (16 x NB) causes a concentration of stress in a 
smaller number of links, which results in increased average 
and maximum stress for used links. The average link stress 
in d x d and Gnutella is lower because nodes receive less 


stripes on average. 


Table 4: Maximum, mean, and median link stress 
for used links and fraction of links (Links) used dur- 
ing multicast with different SplitStream configura- 
tions and 40,000 nodes on GATech. 


We picked the SplitStream configuration that performs 
worst (16 x 16) and compared its performance with Scribe, 
IP multicast, and a centralized system using unicast. Fig- 
ure 12 shows the cumulative distribution of link stress during 
multicast for the different systems on GATech with 40,000 
nodes. A point (x,y) in the graph indicates that a fraction 
y of all the links in the topology has link stress less than or 
equal to x. Table 5 shows the maximum, mean and median 
link stress for used links, and the fraction of links used. 

The results show that SplitStream uses a significantly 
larger fraction of the links in the topology to multicast mes- 
sages than any of the other systems: SplitStream uses 98% 
of the links in the topology, IP multicast and the central- 
ized unicast use 43%, and Scribe uses 47%. This is mostly 
because SplitStream uses both outbound and inbound LAN 
links for all nodes whereas IP multicast and the centralized 


[Cont] centralized | Scribe | IP] 16X16] 
[ Max || 639954 | 3990 
[Mean | 12859 | 39.6 |16 | 295 | 


[Median [16 | 16 [16 [16 
[Links [43 | 47 | 8[ 38 


Table 5: Maximum, mean, and median link stress 
for used links and fraction of links (Links) used by 
centralized unicast, Scribe, IP, and SplitStream mul- 
ticast with 40,000 nodes on GATech. 
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Figure 12: Cumulative distribution of link stress 
during multicast with 40,000 nodes on GATech. 


unicast only use inbound LAN links. Scribe uses all inbound 
LAN links but it only uses outbound LAN links for a small 
fraction of nodes (the interior nodes in the tree). We ob- 
served the same behavior on the other topologies. 

SplitStream loads the links it uses less than Scribe. Ta- 
ble 5 shows that the average stress in links used by Split- 
Stream is close to the average stress in links used by IP 
multicast (28% worse). The average link stress in links used 
by Scribe is 247% worse than that achieved by IP multi- 
cast. SplitStream also achieves a maximum link stress that 
is almost a factor of 3 lower than Scribe’s. The compar- 
ison would be even more favorable to SplitStream with a 
configuration with more spare capacity. 

We also observed this behavior in the other topologies. 
The average stress in links used by SplitStream is at most 
13% worse than the average stress in links used by IP mul- 


ticast on CorpNet and at most 37% worse in Mercator. The 
average link stress in links used by Scribe is 203% worse 
than that achieved by IP multicast in CorpNet and 337% 
worse in Mercator. 

We conclude that SplitStream can deliver higher band- 
width than Scribe by using available bandwidth in node ac- 
cess links in both directions. This is a fundamental advan- 
tage of SplitStream relative to application level multicast 
systems that use a single tree. 


Delay penalty: Next, we quantify the delay penalty of 
SplitStream relative to IP multicast. We computed two met- 
rics of delay penalty for each SplitStream stripe: RMD and 
RAD. To compute these metrics, we measured the distribu- 
tion of delays to deliver a message to each member of a stripe 
group using both the stripe tree and IP multicast. RMD is 
the ratio between the maximum delay using the stripe tree 
and the maximum delay using IP multicast, and RAD is the 
ratio between the average delay using the stripe tree and the 
average delay using IP multicast. 

Figures 13 and 14 show RAD results for SplitStream mul- 
ticast with different configurations with 40,000 nodes. They 
present the cumulative RAD distribution for the 16 stripes. 
A point (x,y) in the graph indicates that a fraction y of the 
16 stripes has RAD less than or equal to y. The results for 
RMD were qualitatively similar and the maximum RMD for 
the configurations in the two figures was 5.4 in GATech and 
4.6 in CorpNet. 


16 
x 
14 
x 
gl2 x RAD (16 x NB) 
a x 
S40 RAD (16 x 32) l 
v = RAD (16 x 18) x 
2 8 — RAD (16 x 16) 
& x 
je 
Z6 
psi x 
O 4 $ 
x 
2 % 
oi , 
0 0.5 1 15 2 25 


Delay penalty 


Figure 13: Cumulative distribution of delay penalty 
with 16 x x configurations on the GATech topology 
with 40,000 nodes. 


Figure 13 shows that the delay penalty increases when the 
spare capacity in the system decreases. The 16 x NB con- 
figuration provides the lowest delay because nodes are never 
orphaned. The mechanisms to deal with orphaned nodes 
(push down and anycast) increase the average depth of the 
stripe trees. Additionally, the bounds on outdegree may 
prevent children from attaching to the closest prospective 
parent in the network. To put these results into context, if 
the stripe trees were built ignoring physical network prox- 
imity, the average RAD would be approximately 3.8 for the 
16 x NB configuration and significantly higher for the other 
configurations. 

Figure 14 shows that the RAD distribution is qualitatively 
similar in GATech and CorpNet for three representative con- 
figurations. We do not present results for Mercator because 
we do not have delay values for the links in this topology. 

The delay penalty tends to be lower in CorpNet because 
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Figure 14: Cumulative distribution of delay penalty 
on GATech and CorpNet with 40,000 nodes. 


Pastry’s mechanism to exploit proximity in the network is 
more effective in CorpNet than in GATech [12]. The figure 
also shows that the delay penalty with Gnutella and with 
16 x 16 is similar for most stripes. However, the variance 
is higher in Gnutella because nodes with a very small for- 
warding capacity can increase the average depth of a stripe 
tree significantly if they are close to the top of the tree. The 
delay penalty is higher with d x d than with the other two 
configurations because the average forwarding capacity is 
lower (only 9), which results in deeper stripe trees. 

We conclude that SplitStream achieves delay that is within 
a small factor of optimal. This performance should be suffi- 
cient even for demanding applications that have an interac- 
tive component, e.g., large virtual class rooms. 


5.4 Resilience to node failures 


Previous experiments evaluated the performance of Split- 
Stream without failures. This section presents results to 
quantify performance and overhead with node failures. The 
performance metric is the fraction of the desired stripes re- 
ceived by each node, and the overhead metric is the number 
of control messages sent per second by each node (similar to 
node stress). 


Path diversity: The first set of experiments analyzed the 
paths from each node į to the root of each stripe that i re- 


ceives. If all these paths are node-disjoint, a single node fail- 
ure can deprive i of at most one stripe (until the stripe tree 
repairs). SplitStream may fail to guarantee node-disjoint 
paths when nodes use anycast to find a parent but this is 
not a problem as the next results show. 


Mex | 68 | 66 | 53] 1 | 72 | 


Table 6: Worst case maximum, mean, and median 
number of stripes lost at each node when a single 
node fails. 


In a 40,000-node SplitStream forest on GATech, the mean 
and median number of lost stripes when a random node fails 
is 1 for all configurations. However, nodes may loose more 
than one stripe when some nodes fail. Table 6 shows the 
max, median, and mean number of stripes lost by a node 
when its worst case ancestor fails. The number of stripes 
lost is very small for most nodes, even when the worst case 
ancestor fails. This shows that SplitStream is very robust 
to node failures. 


Catastrophic failures: The next experiment evaluated 
the resilience of SplitStream to catastrophic failures. We 
created a 10,000-node SplitStream forest with the 16 x 16 
configuration on GATech and started multicasting data at 
the rate of one packet per second per stripe. We failed 2,500 
nodes 10s into the simulation. 

Both Pastry and SplitStream use heartbeats and probes 
to detect node failures. Pastry used the techniques described 
in [25] to control probing rates; it was tuned to achieve 1% 
loss rate with a leaf set probing period of 30s. SplitStream 
nodes send heartbeats to their children and to their parents. 
The heartbeats sent to parents allow nodes to detect when 
a child fails so they can rejoin the spare capacity group. We 
configured SplitStream nodes to send these hearbeats every 
30s. In both Pastry and SplitStream, heartbeats and probes 
are suppressed by other traffic. 

Figure 15 shows the maximum, average, and minimum 
number of stripes received by each node during the experi- 
ment. Nodes loose a large number of stripes with the large 
scale failure but SplitStream and Pastry recover quickly. 
Most nodes receive packets on at least 14 stripes after 30s 
(one failure detection period) and they receive all stripes af- 
ter 60s. Furthermore, all nodes receive all stripes after less 
than 3 minutes. 

Figure 16 breaks down the average number of messages 
sent per second per node during the experiment. The line 
labeled Pastry shows the Pastry overhead and the line la- 
beled Pastry+SplitStream represents the total overhead, i.e., 
the average number of Pastry and SplitStream control mes- 
sages. The figure also shows the total number of messages 
including data packets (labeled Pastry+SplitStream+ Data). 

The results show that nodes send only 1.6 control mes- 
sages per second before the failure. This overhead is very 
low. 

The overhead increases after the failure while Pastry and 
SplitStream repair. Even after repair, the overhead is higher 
than before the failure because Pastry uses a self-tuning 
mechanism to adapt routing table probing rates [25]. This 
mechanism detects a large number of failures in a short pe- 
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Figure 15: Maximum, average, and minimum num- 
ber of stripes received when 25% out of 10,000 nodes 
fail on GATech. 
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Figure 16: Number of messages per second per node 
when 25% out of 10,000 nodes fail on GATech. 


riod and increases the probing rate (to the maximum in this 
case). Towards the end of the trace, the overhead starts to 
dip because the self-tuning mechanism forgets the failures it 
registered. 


High churn: The final simulation experiment evaluated the 
performance and overhead of SplitStream with high churn 
in the overlay. We used a real trace of node arrivals and 
departures from a study of Gnutella [34]. The study mon- 
itored 17,000 unique nodes in the Gnutella overlay over a 
period of 60 hours. It probed each node every seven min- 
utes to check if it was still part of the Gnutella overlay. The 
average session time over the trace was approximately 2.3 
hours and the number of active nodes in the overlay varied 
between 1300 and 2700. Both the arrival and departure rate 
exhibit large daily variations. 

We ran the experiment on the GATech topology. Nodes 
joined a Pastry overlay and failed according to the trace. 
Twenty minutes into the trace, we created a SplitStream 
forest with all the nodes in the overlay at that time. We used 
the 16 x 20 configuration. From that time on, new nodes 
joined the Pastry overlay and then joined the SplitStream 
forest. We sent a data packet to each stripe group every 10 
seconds. Figure 17 shows the average and 0.5th percentile of 
the number of stripes received by each node over the trace. 

The results show that SplitStream performs very well even 
under high churn: 99.5% of the nodes receive at least 75% 
of the stripes (12) almost all the time. The average number 
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Figure 17: Number of stripes received with the 
Gnutella trace on GATech. 


of stripes received by each node is between 15 and 16 most 
of the time. 

Figure 18 shows the average number of messages per sec- 
ond per node in the Gnutella trace. The results show that 
nodes send less than 1.6 control messages per second most 
of the time. The amount of control traffic varies mostly 
because Pastry’s self-tuning mechanism adjusts the probing 
rate to match the observed failure rate. The number of con- 
trol messages never exceeds 4.5 per second per node in the 
trace. Additionally, the vast majority of these messages are 
probes or heartbeats, which are small (50B on the wire). 
Therefore, the overhead is very low, for example, control 
traffic never consumes more than 0.17% of the total band- 
width used with a 1Mb/s data stream. 
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Figure 18: Number of messages per second per node 
with the Gnutella trace on GATech. 


5.5 PlanetLab results 


Finally, we present results of live experiments that evalu- 
ated SplitStream in the PlanetLab Internet testbed [1]. The 
experiments ran on 36 hosts at different PlanetLab sites 
throughout the US and Europe. Each host ran two Split- 
Stream nodes, which resulted in a total of 72 nodes in the 
SplitStream forest. We used the 16 x 16 configuration with 
a data stream of 320Kbits/sec. A 20Kbit packet with a se- 
quence number was multicast every second on each stripe. 
Between sequence numbers 32 and 50, four hosts were ran- 
domly selected and the two SplitStream nodes running on 
them were killed. This caused a number of SplitStream 
nodes to loose packets from one or more stripes while the 
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Figure 19: Number of stripes received versus se- 
quence number. 
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Figure 20: Packet delays versus sequence number. 


affected stripe trees were repaired. 

Figure 19 shows the average and minimum number of 
stripe packets received by any live SplitStream node with 
a particular sequence number. The effect of the failed nodes 
can be clearly seen. The repair time is determined primarily 
by SplitStream’s failure detection period, which triggers a 
tree repair when no heartbeats or data packets have been 
received for 30 seconds. As can be seen in Figure 19, once a 
failure has been detected SplitStream is able to find a new 
parent rapidly. 

Figure 20 shows the minimum, average and maximum de- 
lay for packets received with each sequence number at all 
SplitStream nodes. The delay includes the time taken to 
send the packet from a source to the root of each stripe 
through Pastry. Finally, Figure 21 shows the CDF of the 
delays experienced by all packets received independent of 
sequence number and stripe. 

The delay spikes in Figure 20 are caused by congestion 
losses in the Internet!. However, Figure 21 shows that 90% 
of packets experience a delay of under 1 second. Our re- 
sults show that SplitStream works as expected in an actual 
deployment, and that it recovers from failures gracefully. 


'TCP is used as the transport protocol; therefore, lost pack- 
ets cause retransmission delays. A transport protocol that 
does not attempt to recover packet losses could avoid these 
delays. 
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Figure 21: Cumulative distribution of packet delays. 
5% of the packets experience a delay of greater then 
2 seconds and the maximum observed delay is 11.7 
seconds. 


6. RELATED WORK 


Many application-level multicast systems have been pro- 
posed [16, 22, 32, 40, 13, 6, 23]. All are based on a single 
multicast tree per sender. Several systems use end-system 
multicast for video distribution, notably Overcast [22] and 
SpreadIt [7]. SplitStream differs in that it distributes for- 
warding load over all participants using multiple multicast 
trees, thereby reducing the bandwidth demands on individ- 
ual peers. 

Overcast organizes dedicated servers into a source-rooted 
multicast tree using bandwidth estimation measurements to 
optimize bandwidth usage across the tree. The main differ- 
ences between Overcast and SplitStream are (i) that Over- 
cast uses dedicated servers while SplitStream utilises clients; 
(ii) Overcast creates a single bandwidth optimised multicast 
tree, whereas SplitStream assumes that the network band- 
width available between peers is limited by their connections 
to their ISP rather than the network backbone. This sce- 
nario is typical because most backbone links are over provi- 
sioned. 

Nguyen and Zakhor [29] propose streaming video from 
multiple sources concurrently, thereby exploiting path di- 
versity and increasing tolerance to packet loss. They subse- 
quently extend the work in [29] to use Forward Error Cor- 
rection [9] encodings. The work assumes that the client is 
aware of the set of servers from which to receive the video. 
SplitStream constructs multiple multicast trees in a decen- 
tralized fashion and is therefore more scalable. 

Apostolopoulos [4, 5] originally proposed utilising striped 
video and MDC to exploit path diversity for increased ro- 
bustness to packet loss. They propose building an overlay 
composed of relays, and having each stripe delivered to the 
client using a different source. The work examines the per- 
formance of the MDC, but does not describe an infrastruc- 
ture to actually forward the stripes to the clients. Coop- 
Net [30] implements such a hybrid system, which utilises 
multiple trees and striped video using MDC. When the video 
server is overloaded, clients are redirected to other clients, 
thereby creating a distribution tree routed at the server. 
There are two fundamental differences between CoopNet 
and SplitStream: (i) CoopNet uses a centralised algorithm 
(running on the server) to build the trees while SplitStream 
is completely decentralised; and (ii) CoopNet does not at- 
tempt to manage the bandwidth contribution of individual 


nodes. However, it is possible to add this capability to Coop- 
Net. 

FCast [20] is a reliable file transfer protocol based on IP 
multicast. It combines a Forward Error Correction [9] en- 
coding and a data carousel mechanism. Instead of relying on 
IP multicast, FCast could be easily built upon SplitStream, 
for example, to provide software updates cooperatively. 

In [10, 26], algorithms and content encodings are described 
that enable parallel downloads and increase packet loss re- 
silience in richly connected, collaborative overlay networks 
by exploiting downloads from multiple peers. 


7. CONCLUSIONS 


We presented the design and evaluation of SplitStream, 
an application-level multicast system for high-bandwidth 
data dissemination in cooperative environments. The sys- 
tem stripes the data across a forest of multicast trees to bal- 
ance forwarding load and to tolerate faults. It is able to dis- 
tribute the forwarding load among participating nodes while 
respecting individual node bandwidth constraints. When 
combined with redundant content encoding, SplitStream of- 
fers resilience to node failures and unannounced departures, 
even while the affected multicast tree is repaired. The over- 
head of forest construction and maintenance is modest and 
well balanced across nodes and network links, even with 
high churn. Multicasts using the forest do not load nodes 
beyond their bandwidth constraints and they distribute the 
load more evenly over the network links than application- 
level multicast systems that use a single tree. 

The simulator and the versions of Pastry and SplitStream 
that ran on top of it are available upon request from Mi- 
crosoft Research. The version of Pastry and SplitStream 
that ran on PlanetLab is available for download from: http: 
//wiw.cs.rice.edu/CS/Systems/Pastry/FreePastry 
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